MSE MachLe -- Data import and data visualization

Helmut Grabner, Spring Term, 2021

Objectives:

Tasks:

Questions:

Hints:

(1) Edgar Anderson's Iris Data

This famous iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for flowers from each species of iris. The species are Iris setosa, versicolor, and virginica.

Add your interpretations & findings here:

...

(2) A more complex dataset

Choose another dataset which is more complex, i.e. has at least 100 observations and 8 attributes (e.g. sns.load_dataset("titanic")). Try to visualize your dataset and explain your observations.

Look at the hints to find new datasets.

Add your interpretations & findings here:

...

MSE MachLe -- Similarities, k-NN & Perforamnce Measurements (SOLUTION)

Helmut Grabner, Spring Term, 2021

Objectives:

Questions:

Hints:

(1) Distances - Curse of dimensionality

Draw two random points in n-dimensions and calulate their distance using Euclidian Norm and repeat 1000 times. Normalize by the maximim possible values. For 2 dimensions the maximum distance ist 2, how does it generalize to higher dimensions?

What do you obsereve? Interpret results!

Same experiment as above, but use a differn distance funtion, e.g., L1 or L. What changes?

Add your interpretations & findings here:

...

(2) Performance measruements - Breast cancer

Split the data into testsize 30% and trainsize 70% and perform classification using a k-NN classifier with k = 5. Calculate accuracy, recall, precision and f1-score for your classifier and plot the confusion matrix.

Add your interpretations & findings here:

...

(3) k-NN, Diabetes data

Fit a k-NN classifier to the given dataset containing data from diabetes patients. Choose as k, the values 1, 5 and n (number of train samples) and plot the corresponding confusion matrix.

Add your interpretations & findings here:

...

MSE MachLe -- Hyperparameter Tuning (SOLUTION)

Helmut Grabner, Spring Term, 2021

Objectives:

Questions:

Hints:

(1) k-NN: whats the optmal k?

Following up the example from last week, we aim for automatically dertermining the best configuration (i.e., the Hyperparamter k). Loop from 1 to 100 to find the best k for k-NN. Split the data-set into 90% trainings and 10% test examples.

(1a) Validation-Set

Split your training data again into two sets: 90% used for training and 10% used for validation.

(1b) Cross-Validation

Add your interpretations & findings here:

Compare the results from (1a) and (1b).

MSE MachLe -- Bagging and Boosting Examples

Helmut Grabner, Spring Term, 2021

Bagging example

Boosting example

MSE MachLe -- Trees and Forests (SOLUTION)

Helmut Grabner, Spring Term, 2021

Objectives:

Questions:

Hints:

(1) Decision Trees

Add your interpretations and findings here:

...

(2) Random Forest

Add your interpretations and findings here:

...

Add your interpretations and findings here:

...

(3) Sentiment analysis with IMDB reviews

In this notebook we work with the IMDb dataset, it is a binary sentiment analysis dataset consisting of 50,000 reviews from the Internet Movie Database (IMDB) labeled as positive (1) or negative (0).

A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. We will apply a very simple preprocessing to the textreviews and then train a baseline randomforest on bag of words features.

You can test the trained model on new reviews from the internet or by writting your own review for a movie you like or don't like.

Follow the instructions in the cells!

Add your interpretations and findings here:

...

Add your interpretations and findings here:

...

Feature Engineering and Machine Learning

Why feature engineer at all?

To extract more information from your data. For example, check out the 'Name' column:

Notice that this columns contains strings (text) that contain 'Title' such as 'Mr', 'Master' and 'Dona'. You can use regular expressions to extract the Title (to learn more about regular expressions, check out my write up of our last FB Live code along event):

Being cabinless may be important

Recap:

Up next: deal with missing values and bin your numerical data, transform all features into numeric variables. Then you'll build your final model for this session.

If you're enoying this session, retweet or share on FB now and follow us on Twitter: @hugobowne & @DataCamp.

Dealing with missing values

Bin numerical data

Create a new column: number of members in family onboard

Transform all variables into numerical variables

Recap:

Up next: It's time ... to build your final model!

If you're enoying this session, retweet or share on FB now and follow us on Twitter: @hugobowne & @DataCamp.

Building models with our new dataset!

You're now going to build a decision tree on your brand new feature-engineered dataset. To choose your hyperparameter max_depth, you'll use a variation on test train split called cross validation.

We begin by splitting the dataset into 5 groups or folds. Then we hold out the first fold as a test set, fit our model on the remaining four folds, predict on the test set and compute the metric of interest. Next we hold out the second fold as our test set, fit on the remaining data, predict on the test set and compute the metric of interest. Then similarly with the third, fourth and fifth.

As a result we get five values of accuracy, from which we can compute statistics of interest, such as the median and/or mean and 95% confidence intervals.

We do this for each value of each hyperparameter that we're tuning and choose the set of hyperparameters that performs the best. This is called grid search.

Accuracy = 78.9.

Next steps

See if you can do some more feature engineering and try some new models out to improve on this score. I'll post all of this on github and on the DataCamp community and it would be great to see all of you improve on these models.

There's a lot more pre-processing that you'd like to learn about, such as scaling your data. You'll also find scikit-learn pipelines super useful. Check out our Supervised Learning with scikit-learn course and the scikit-learn docs for all of this and more.

MSE MachLe -- (Deep) Neural Networks (SOLUTION)

Helmut Grabner, Spring Term, 2021

Objectives:

Questions:

Hints:

(1) Neural Net with Fashion MNIST (1 point)

This notebook trains a neural network model to classify images of clothing, like sneakers and shirts. It's okay if you don't understand all the details in this notebook. It should only give you an idea of what else exists besides the classical statistical/machine learning methods.

This guide uses the Fashion MNIST dataset by Zalando which contains 70,000 grayscale images in 10 categories. The images show individual articles of clothing at low resolution (28 by 28 pixels).

Here, 60,000 images are used to train the network and 10,000 images to evaluate how accurately the network learned to classify images. You can access the Fashion MNIST directly from TensorFlow.

Train a Deep Neural Network

Here we start building the (deep) neural net, the first and the last layer are given. The model should also give you an output as it is now with an accuracy of approximately 80%. The idea is that you change and/or add layers to the model to get a better accuracy.

See the hints for different neural network architectures/styles!

Among the possible Laysers you can add are :

It is your choice how deep or complex you want to build your neural network.

Hint: Consider that for a Conv2D model you have to reshape your data to (60000, 28, 28, 1) respectively (10'000, 28, 28, 1) and remove the Flatten layer !

Describe your interpretations, observations and performance regarding your neural network in the last cell.

In the following cell the the model that you specified above will be compiled an fitted to your data.

If you want you can adapt parameters such as optimizer and/or epochs, no worries if you don't get all the details.

The cell below visualizes the input of your predictions with the corresponding images.

Add your interpretations and findings:

...

(2) Classical method with Fashion MNIST

You have seen the deeplearning approach for the problem above. Try to predict the classes for the fashion mnist dataset with the classical machine learning methods you learned in the lecture.

Machine Learning
MSE FTP MachLe
Christoph Würsch

ML07 A2 Simple SVM examples

MSE_FTP_MachLe, WÜRC

Excecute the following examples to get a feeling for the SVM and how to use it with scikit-learn. We start with a random two dimensional data set.

Creation of data

We create 2 times 100 data points as follows:

Training and prediction using a SVM

Execute the following code and adpot the code to make predictions for a few other points. Does the result make sense?

Cross validation

Play around with the code below. Which parameter C gives the best leave-one-out cross validation error?

Parameter Optimization

The following code is adapted from here (originally from the scikit-learn repos) and shows how to systematically perform a parameter optimization.

To do so, we split the data into a train and test set. First, we use the training set to find the parameters which give the best accuracy.

Finding the optimal parameter for the training set

We evaluate a linear and a RBF kernel with different parameters.

Evaluation of the optimal parameters on untouched test-set

We see that a SVM with a linear kernel is most appropriate. We now evaluate this parameters on the test set which we did not touch so far. Since we did not touch the test set yet, this performance is a good proxy for new unseen data (if it comes from the same distribution).

Machine Learning
MSE FTP MachLe
Christoph Würsch

ML07: A3 Support Vector Machine: Pima Indians

MSE_FTP_MachLe, WÜRC

Pima-Indian dataset

The data set pima-indians-diabetes.csv contains medical indicators collected from 768 patients of the indigenous Pima tribe (Phoenix, AZ). The subjects were between 21 and 81 years old. The following characteristics were recorded:

NumTimesPrg: Number of pregnancies PlGlcConc: Blood sugar 2h after oral uptake of sugar (oGTT) (mg/dl) BloodP: Diastolic blood pressure (mm Hg) SkinThick: thickness of the skin fold at the triceps (mm) TwoHourSerIns: Insulin concentration after 2h oGTT ( 𝜇 IU/ mg) BMI:Body Mass Index (kg/m 2 ) DiPedFunc: Hereditary predisposition Age: Age (years) HasDiabetes :Diagnosis Diabetes Type II

The classification goal is to make the diagnosis of type II diabetes based on the factors, i.e. to make a prediction model for the variable y= HasDiabetes using a support vector machine (SVM) with a radial basis function kernel.

Pima Indians

Explorative Data Analysis (EDA)

Load the dataset pima-indians-diabetes.csv and create a pandas data frame from it. We take care that the column captions are imported correctly.

Display the first 5 entries of the dataset.

It is already noticeable here that the insulin values of some patients have the value '0'.

(a) Calculate the percentage of patients with diabetes and display a statistics using df.describe

We can calculate the percentage of the patients with diabetes by determining the mean value of the response column 'HasDiabetes'. Using the dataframe.describe(), we can print the main statistics of the dataset.

Some characteristics take the value 0, although this makes no medical sense. These are

For example, the insulin value has at least a quarter missing. For these characteristics, we replace the 0 values with np.nan and then count again. For all other characteristics we do not know whether there are any other missing values.

(b) Exlude samples with zero entries or missing values

Replace the 0 values with np.nanand print again a statstical description of the dataset using df.describe(). Then drop np.nan values using df.dropna().

(c) Plot a histogram of the of each feature and the the target using df.hist()

(d) Split the data in 80% training and 20% test data

Use train_test_split from sklearn.model_selection. If you feed a pandas.Dataframe as an input to the method, you will also get pandas.Dataframes as output for the training and test features. This is quite practical.

(e) Standardize the features using the StandardScaler from sklearn.preprocessing

Standardize the features using the StandardScaler from sklearn.preprocessing and display the histograms of the features again.

(f) Train a support vector machine with a radial basis function kernel and determine the accuracy on the test data.

(g) Perform a cross validated grid search GridSearchCV to find the best parameters for γ and C and determine the best parameters and score

(h) Plot the resulting decision boundary of the SVM classifier for different values of γ and C as contour plot in 2D

Use the following features as the only features for the prediction such that we can display the decision boundary in a 2D plot.

Vary the C parameters in the following ranges:

You can use the helper function PlotDecisionBoundary(model, X2D, y) to plot the decision boundary and the margins of the classifier.

Machine Learning
MSE FTP MachLe
Christoph Würsch

ML07: A4 SVM with polynomial and linear kernel

MSE, FTP_MachLe, WÜRC

Whenever you have a model that is represented with inner products, you can plug in a kernel function. For instance, a linear kernel is the same as applying linear transformations to feature space. And, in this case, it’s the same as a support vector classifier, because the decision boundary is linear.

(a) Split the data in 70% training and 30% test data

using train_test_split from sklearn.model_selection.

(b) Plot the training data as scatter plot using different colors for both classes.

(c) Fit an SVM with a second-degree polynomial kernel using γ=0.1 and C=1

There’s a little space between the two groups of data points. But closer to the center, it’s not clear which data point belongs to which class. A quadratic curve might be a good candidate to separate these classes. So let’s fit an SVM with a second-degree polynomial kernel.

(d) Plot the margins and decision boundary of the classifier

(e) Vary the hyperparameter γ logarithmically in the range from 102 to 10+2 in 5 steps.

(f) Vary the hyperparameter C logarithmically in the range from 100 to 104 in 5 steps.

Corona Test: Application of Bayes' rule

MSE_FTP_MachLe

The following probabilities are given:

p(c)=103=0.1%p(t|c)=910=90%p(t|c¯)=1100=1%

What is the probability that you actually are infected, given a positive test result? We can use Bayes' rule to invert the conditionals:

(1)p(c|t)=p(t|c)p(c)p(t)=p(t|c)p(c)cip(t,c)(2)=p(t|c)p(c)cip(t|ci)p(ci)=p(t|c)p(c)p(t|c)p(c)+p(t|c¯)p(c¯)(3)=p(t|c)p(c)p(t|c)p(c)+p(t|c¯)(1p(c))

Machine Learning
MSE FTP MachLe
Christoph Würsch

The Naive Bayes Classifier

MSE FTP_MachLe, Christoph Würsch

Naive Bayes is a conditional probability model: given a problem instance to be classified, represented by a vector x=(x1,,xn) representing some n features (independent variables), it assigns to this instance probabilities

p(Ckx1,,xn)

for each of K possible outcomes or classes Ck

The problem with the above formulation is that if the number of features n is large or if a feature can take on a large number of values, then basing such a model on probability tables is infeasible. We therefore reformulate the model to make it more tractable. Using Bayes' theorem, the conditional probability can be decomposed as:

p(Ckx)=p(Ck) p(xCk)p(x)

Using Bayesian probability terminology, the above equation can be written as

posterior=priorlikelihoodevidence

In practice, there is interest only in the numerator of that fraction, because the denominator does not depend on C and the values of the features xi are given, so that the denominator is effectively constant. The numerator is equivalent to the joint probability model

p(Ck,x1,,xn) which can be rewritten as follows, using the chain rule for repeated applications of the definition of conditional probability:

p(Ck,x1,,xn)=p(x1,,xn,Ck)=p(x1x2,,xn,Ck) p(x2,,xn,Ck)=p(x1x2,,xn,Ck) p(x2x3,,xn,Ck) p(x3,,xn,Ck)==p(x1x2,,xn,Ck) p(x2x3,,xn,Ck)p(xn1xn,Ck) p(xnCk) p(Ck)

Now the "naive" conditional independence assumptions come into play: assume that all features in x are mutually independent, conditional on the category Ck. Under this assumption,

p(xixi+1,,xn,Ck)=p(xiCk)

Thus, the joint model can be expressed as

p(Ckx1,,xn)p(Ck,x1,,xn)=p(Ck) p(x1Ck) p(x2Ck) p(x3Ck) =p(Ck)i=1np(xiCk),

where denotes proportionality.

This means that under the above independence assumptions, the conditional distribution over the class variable C is:

p(Ckx1,,xn)=1Zp(Ck)i=1np(xiCk)

where the evidence Z=p(x)=kp(Ck) p(xCk) is a scaling factor dependent only on x1,,xn that is, a constant if the values of the feature variables are known.

Source: https://en.wikipedia.org/wiki/Naive_Bayes_classifier

An introductory example: Mrs Marple plays Golf

Let's analyze the probabilty for Mrs Marple to play Golf on a given day. Mrs Marple has recorded her decisions on 14 different days. Depending on the Temperature, the Outlook, the Humidity and Wind Condition of the day, we want to predict, whether she is going to play Golf or not.

The response y is: Play Golf?.

1. Calculation of the Prior

The prior probabilty is the probability to play Golf at all. For this we only have to look at the last column.

p(y=yes)=914p(y=no)=514

2. Calculation of the conditional probabilities p(Xi|y)

As a next step, we have to calculate all conditional probabilities, i.e. the probability for every feature Xi given y=yes and y=no. We can use a contingency table (pandas.cosstab) to calculate these probabilities.

From this table, we can calculate the following:

From this table, we can calculate the following conditionals:

From this table, we can calculate the following conditionals:

And finally for the last feature X4, we get

3. Making predictions

What is now the decision to play Golf under the the following conditions?

X={X1,X2,X3,X4}={hot, sunny, high, True}

It is not necessary, to calculate the evidence Z=p(X), because we can compare the following two nominators:

p(y=yes|X)p(X1|y)p(X2|y)p(X3|y)(X4|y)p(y)

and p(y=no|X)p(X1|y¯)p(X2|y¯)p(X3|y¯)(X4|y)p(y¯)

By looking at our tables, we get for the probability to play Golf y=yes:

p(y=yes|X)=1Zp(X1|y)p(X2|y)p(X3|y)(X4|y)p(y)=29293939914=2292114Z

By looking at our tables, we get for the probability not to play Golf: y=no:

p(y=not|X)=1Zp(X1|y¯)p(X2|y¯)p(X3|y¯)(X4|y¯)p(y¯)=25254535514=2453114Z

Now, we can compare these two probabilites and calculate the ratio:

p(y=yes|X={hot, sunny, high, True})p(y=no|X={hot, sunny, high, True})=22539224=125481=125324<1

4. What happens, if one of the counts is zero?

In this case, the posterior p(y|X) is zero as well. To avoid this, Additive Smoothing (Laplace Smoothing) is normally applied to the data.

pi=ni+1n+k

where:

Interested readers can have a look at: https://en.wikipedia.org/wiki/Additive_smoothing

Machine Learning
MSE FTP MachLe
Christoph Würsch

Bayesian Inference for a Gaussian

Extended Solution for A10, Series 8

Assume a Gaussian likelihood function of the follwing form with known variance σ.

p(Xμ)=n=1Np(xnμ)=1(2πσ2)N/2exp{12σ2n=1N(xnμ)2}

In this case, the posterior probability distribution will again be a Gaussian distribution N and has the same form as the prior. The prior is then called a conjugate prior to the posterior.

Again we emphasize that the likelihood function p(Xμ) is not a probability distribution over μ and is not normalized. We see that the likelihood function takes the form of the exponential of a quadratic form in μ. Thus if we choose a prior p(μ) given by a Gaussian, it will be a conjugate distribution for this likelihood function because the corresponding posterior will be a product of two exponentials of quadratic functions of μ and hence will also be Gaussian N. We therefore take our prior distribution to be

p(μ)=N(μμ0,σ02)

And the posterior distribution is given by:

p(μX)p(Xμ)p(μ)

We consider only the terms in the exponentials and neglect the normalization factor. In this case, we have:

exp{12σ2n=1N(xnμ)2}exp{12σ02(μμ0)2}

Now, we only look at the quadratic term Q. The posterior probability distribution is a function of μ. So we are interested only in the linear and quadratic terms in μ.

Q={12σ2n=1N(xnμ)212σ02(μμ0)2}={12σ2n=1N(xn22xnμ+μ2)12σ02(μ22μμ0+μ02)}={12σ2(n=1Nxn22μn=1Nxn+Nμ2)12σ02(μ22μμ0+μ02)}={μ2(12σ02+N2σ2)+2μ(n=1Nxn2σ2+μ02σ02)+}12{(1σ02+Nσ2)[μnxnσ2+μ0σ021σ02+Nσ2]2+}

The term in front of the square bracket is the inverse variance: (4)1σN2=Nσ2+1σ02

The shift term in the quare bracket is the mean μN:

μN=nxnσ2+μ0σ021σ02+Nσ2=σ02nxn+σ02μ0σ2+Nσ02=σ2σ2+Nσ02μ0+Nσ02σ2+Nσ021Nnxn=σ2σ2+Nσ02μ0+Nσ02σ2+Nσ02μML

It is worth spending a moment studying the form of the posterior mean and variance. Using simple manipulation involving completing the square in the exponent, you could show that the posterior distribution is given by:

(5)p(μX)=N(μμN,σN2)(6)1σN2=Nσ2+1σ02(7)μN=σ2Nσ02+σ2μ0+Nσ02Nσ02+σ2μML

in which μML is the maximum likelihood solution for μ given by the sample mean

(8)μML=1Nn=1Nxn

Python example for a Bayesian Update

Let's do a concrete example using Python to demonstrate this. We cose a simple Gaussian prior for μ with μ0=0 and σ02=4.

(9)p(μ)=N(μ0,4)

(i) The prior p(μ) belief about the mean of the data xi

(ii) The distribution of the data xi

In our model we assume that the distribution of the data is also a Gaussian distribution. Therefore we can caluclate the likelihood of the data xi given the model and estimate the maximum likelihood mean μML and σML2.

Now, lets draw N=30 samples from another Gaussian distribution that describes the distribution of the samples. These will be our measurement points xi. We assume that the mean and the variance of the data distribution is given by: μ=2σ2=1

We calculate the empirical mean and variance of the data xi.

(iii) Calculation of the posterior distribution

According to our calculations, we have for the posterior:

(10)p(μX)=N(μμN,σN2)(11)1σN2=Nσ2+1σ02(12)μN=σ2Nσ02+σ2μ0+Nσ02Nσ02+σ2μML

in which μML is the maximum likelihood solution for μ given by the sample mean

(13)μML=1Nn=1Nxn

First of all, we note that the mean of the posterior distribution μN given by the equation above is a compromise between the prior mean μ0 and the maximum likelihood solution μML. If the number of observed data points N=0, then this equation reduces to the prior mean as expected. For N, the posterior mean is given by the maximum likelihood solution.

Similarly, consider the result for the variance of the posterior distribution. We see that this is most naturally expressed in terms of the inverse variance, which is called the precision. Furthermore, the precisions are additive, so that the precision of the posterior is given by the precision of the prior plus one contribution of the data precision from each of the observed data points. As we increase the number of observed data points, the precision steadily increases, corresponding to a posterior distribution with steadily decreasing variance. With no observed data points, we have the prior variance, whereas if the number of data points N, the variance σN2 goes to zero and the posterior distribution becomes infinitely peaked around the maximum likelihood solution.

We therefore see that the maximum likelihood result of a point estimate for μ given by μML is recovered precisely from the Bayesian formalism in the limit of an infinite number of observations. Note also that for finite N, if we take the limit σ02 which the prior has infinite variance then the posterior mean μ reduces to the maximum likelihood result, while the posterior variance is given by σN2=σ2N.

Now lets change the number of measurement points N contiunously and draw the different distributions:

What we observe is, that the posterior distribution gets more and more peaked around the data distribution and the estimate of the true mean μN gets very narrowly centered around the data mean. The more data we have, the less important our intial estimate about the prior distribution becomes.

Machine Learning
MSE FTP MachLe
Christoph Würsch

Inspector Clouseau: Extended Solution

Lets first import pandas and define the quantities that are given by the excercise.

The nomenclature is as follows:

conditional probabilities:

conditional probabilities

p(K|B,M)=0.1p(K|B,M¯)=0.6p(K|B¯,M)=0.2p(K|B¯,M¯)=0.3

Using Bayes Theroem we can invert the conditonal probabilities.

p(B|K)=p(K|B)p(B)p(K)

To calculate the prior p(K), we even have to marginalize over the states bdom(B) of the butler B.

p(B|K)=p(K|B)p(B)p(K)=p(K|B)p(B)bdom(B)p(K|B)p(B)

Since the conditionals also depend on the state of the Maid M, we have to marginalize over the states mdom(M) of the Maid M. If we know the full joint probability density p(k,b,m) for all states of K,B and M, we could just marignalize over the joint probability distribution:

p(K)=bdom(B){mdom(M)p(K,b,m)}

But now that we are only given the conditional probabilites, we have to use Bayes' therorem to calculate the joint probability distribution:

p(k,b,m)=p(k|b,m)p(b)p(m)

So we have to sum up over the conditional multiplied by the priors p(b) and p(m):

p(K)=bdom(B){mdom(M)p(K|b,m)p(m)}p(b)

Using this, we get:

p(B|K)={mdom(M)p(K|B,m)p(m)}p(B)bdom(B){mdom(M)p(K|b,m)p(m)}p(b)

Let's test whether we did all calculations right: The probabilities must sum up to one.

Now, we can easily calculate the marginals:

Marginal for the Knife=True: p(K=True)=bdom(B){mdom(M)p(K=True,b,m)}

Marginal for the Butler=True: p(B=True)=kdom(K){mdom(M)p(k,B=True,m)}

Marginal for the Maid=True: p(M=True)=kdom(K){bdom(B)p(k,b,M=True)}

Machine Learning
MSE FTP MachLe
Christoph Würsch

Bayesian Regression using pymc3

Motivation from Cam Davidson Pilon

"The Bayesian method is the natural approach to inference, yet it is hidden from readers behind chapters of slow, mathematical analysis. The typical text on Bayesian inference involves two to three chapters on probability theory, then enters what Bayesian inference is. Unfortunately, due to mathematical intractability of most Bayesian models, the reader is only shown simple, artificial examples. [...]

If Bayesian inference is the destination, then mathematical analysis is a particular path towards it. On the other hand, computing power is cheap enough that we can afford to take an alternate route via probabilistic programming.

The latter path is much more useful, as it denies the necessity of mathematical intervention at each step, that is, we remove often-intractable mathematical analysis as a prerequisite to Bayesian inference. Simply put, this latter computational path proceeds via small intermediate jumps from beginning to end, where as the first path proceeds by enormous leaps, often landing far away from our target. Furthermore, without a strong mathematical background, the analysis required by the first path cannot even take place."

References:

https://docs.pymc.io/ https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers https://camdavidsonpilon.github.io/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/ https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/zipball/master

Hamiltonian Monte Carlo (HMC) is a Markov chain Monte Carlo (MCMC) algorithm that avoids the random walk behavior and sensitivity to correlated parameters that plague many MCMC methods by taking a series of steps informed by first-order gradient information. These features allow it to converge to high-dimensional target distributions much more quickly than simpler methods such as random walk, Metropolis or Gibbs sampling.

The No-U-Turn Sampler (NUTS) is an extension to HMC that eliminates the need to set a number of steps L. NUTS uses a recursive algorithm to build a set of likely candidate points that spans a wide swath of the target distribution, stopping automatically when it starts to double back and retrace its steps. Empirically, NUTS performs at least as efficiently as and sometimes more efficiently than a well tuned standard HMC method, without requiring user intervention or costly tuning runs.

The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo Matthew D. Hoffman, Andrew Gelman https://arxiv.org/abs/1111.4246

Machine Learning
MSE FTP MachLe
Christoph Würsch

Brownian Motion

MSE FTP_MachLe HS2020

Christoph Würsch, Institute for Computational Engineering, ICE, OST

In probability theory and related fields, a stochastic or random process is a mathematical object usually defined as a family of random variables. Many stochastic processes can be represented by time series.

A stochastic process (also called random process) is the mathematical description of temporally ordered, random processes. The theory of stochastic processes represents a significant extension of probability theory and forms the basis for stochastic analysis.

Time series and discrete stochastic processes [1]

Let T be a set of equidistant time points T={t1,t2,}.

  1. A discrete stoachstic process is a set of random variables {X1,X2,}. Each single random variable Xi has a univariate distribution function Fi and can be observed at time ti.
  2. A time series {x1,x2,} is a realization of a discrete stochastic process {X1,X2,}. In other words, the value xi is a realization of the random variable Xi measured at time ti.

[1] Time Series Analysis and Its Applications by Robert H. Shumway and David S. Stoffner, Springer 2011.

Example: The Wiener process Wt is characterised by the following properties:

  1. W0=0
  2. W has independent increments: for every t>0, the future increments Wt+uWt0, are independent of the past values Ws, st
  3. W has Gaussian increments: Wt+uWt is normally distributed with mean 0 and variance u, Wt+uWtN(0,u).
  4. W has continuous paths: Wt is continuous in t.

We simulate Brownian motions with 5000 time steps:

We simulate two independent one-dimensional Brownian processes to form a single two-dimensional Brownian process. The (discrete) Brownian motion makes independent Gaussian jumps at each time step. Therefore, we merely have to compute the cumulative sum of independent normal random variables (one for each time step):

Now, to display the Brownian motion, we could just use plot(x, y). However, the result would be monochromatic and a bit boring. We would like to use a gradient of color to illustrate the progression of the motion in time (the hue is a function of time). matplotlib does not support this feature natively, so we rather us scatter(). This function allows us to assign a different color to each point at the expense of dropping out line segments between points. To work around this issue, we linearly interpolate the process to give the illusion of a continuous line:

Machine Learning
MSE FTP MachLe
Christoph Würsch

Lab 9, A3: Illustration of prior and posterior Gaussian process for different kernels

Multivariate Gaussian distributions are useful for modeling finite collections of real-valued variables because of their nice analytical properties. Gaussian processes GP are the extension of multivariate Gaussians to infinite-sized collections of real-valued variables. In particular, this extension will allow us to think of Gaussian processes as distributions not just over random vectors but in fact distributions over random functions.

Unlike classical learning algorithm, Bayesian algorithms do not attempt to identify “best-fit” models of the data (or similarly, make “best guess” predictions for new test inputs). Instead, they compute a posterior distribution over models (or similarly, compute posterior predictive distributions for new test inputs). These distributions provide a useful way to quantify our uncertainty in model estimates, and to exploit our knowledge of this uncertainty in order to make more robust predictions on new test points.

Gaussian Process

A stochastic process is a collection of random variables, {f(x):xX}, indexed by elements from some set X, known as the index set.

A Gaussian process is a stochastic process such that any finite subcollection of random variables has a multivariate Gaussian distribution. In particular, a collection of random variables {f(x):xX} is said to be drawn from a Gaussian process GP with mean function m(·) and covariance function k(·,·) if for any finite set of elements x1,...,xmX, the associated finite set of random variables f(x1),,f(xm) have distribution,

When we form a Gaussian process GP we assume data is jointly Gaussian with a particular mean and covariance,

p(f|X)N(m(X),K(X)),fGP(m(x),k(x,x)),

where the conditioning is on the inputs X which are used for computing the mean and covariance. For this reason they are known as mean and covariance functions. To make things clearer, let us assume, that the mean function m(X) is zero, i.e. the jointly Gaussian distribution is centered around zero.

In this case, the Gaussian process perspective takes the marginal likelihood of the data to be a joint Gaussian density with a covariance given by K. So the model likelihood is of the form, p(y|X)=1(2π)n2|K|12exp(12y(K+σ2I)1y) where the input data, X, influences the density through the covariance matrix, K whose elements are computed through the covariance function, k(x,x).

This means that the negative log likelihood (the objective function) is given by, E(θ)=12log|K|+12y(K+σ2I)1y where the parameters of the model are also embedded in the covariance function, they include the parameters of the kernel (such as lengthscale and variance), and the noise variance, σ2. These parameters are called hyperparametres. Often, they are specified as the logarithm of the hyperparameters and are then called loghyperparameters.

Making Predictions - computing the posterior

We therefore have a probability density that represents functions. How do we make predictions with this density? The density is known as a process because it is consistent. By consistency, here, we mean that the model makes predictions for f that are unaffected by future values of f that are currently unobserved (such as test points). If we think of f as test points, we can still write down a joint probability density over the training observations, f and the test observations, f. This joint probability density will be Gaussian, with a covariance matrix given by our covariance function, k(xi,xj).

[ff]N(0,[KKKK,])

where here K is the covariance computed between all the training points, K is the covariance matrix computed between the training points and the test points and K, is the covariance matrix computed betwen all the tests points and themselves.

Conditional Density

Just as in naive Bayes, we first define the jointdensity, p(y,X) and now we need to define conditional distributions that answer particular questions of interest. In particular we might be interested in finding out the values of the function for the prediction function at the test data given those at the training data, p(f|f). Or if we include noise in the training observations then we are interested in the conditional density for the prediction function at the test locations given the training observations, p(f|y).

As ever all the various questions we could ask about this density can be answered using the sum rule and the product rule. For the multivariate normal density the mathematics involved is that of linear algebra, with a particular emphasis on the partitioned inverse or block matrix inverse, but they are beyond the scope of this course, so you don't need to worry about remembering them or rederiving them. We are simply writing them here because it is this conditional density that is necessary for making predictions.

The conditional density is also a multivariate normal, p(f|y)N(μf,Cf) with a mean given by μf=K[K+σ2I]1y and a covariance given by Cf=K,K[K+σ2I]1K.

But now let's look at some frequently used covariance functions:

(a) Analyze the code

We first instantiate a list of kernels that we want to analyze. The following kernels (covariance functions) will be analyzed:

(b) Kernel functions

Have a look at the samples drawn from the given Gaussian processes for each kernel: RBF, Matern, RationalQuadratic, ExpSineSquared, DotProduct and ConstantKernel. Give an example for data that could be described by these covariance functions.

(i) The RBF kernel (also called Squared Exponential (SE) kernel)

The radial basis function kernel or short RBF kernel is a stationary kernel. Stationary means, the kernel K(x,x)=K(xx) is invariant to translations. It is also known as the “squared exponential” kernel. It is parameterized by a length-scale parameter λ, which can either be a scalar (isotropic variant of the kernel) or a vector with the same number of dimensions as the inputs (anisotropic variant of the kernel). The kernel is given by:

k(x,x)=σ02exp[12(xxλ)2]

This kernel is infinitely differentiable, which implies that GPs with this kernel as covariance function have mean square derivatives of all orders, and are thus very smooth.

Example: We can use it to model and predict very smooth processes or signals.

(ii) the Rational-Quadratic kernel

The RationalQuadratic kernel can be seen as a scale mixture (an infinite sum) of RBF kernels with different characteristic length-scales. It is parameterized by a length-scale parameter λ and a scale mixture parameter α. Only the isotropic variant where λ is a scalar is supported at the moment. The kernel is given by:

k(x,x)=(1+(xx)22αλ2)α

(iii) the Exp-Sine-Squared kernel

The Exp-Sine-Squared kernel allows modeling periodic functions. It is parameterized by a length-scale parameter λ and a periodicity parameter p. Only the isotropic variant where is a scalar is supported at the moment. The kernel is given by:

k(x,x)=exp[2sin2(π/p|xx|)λ2]

Example: modeling periodic changes, e.g. daily temperature changes, seasonally changing quantities, such as the CO2 concentration.

(iv) the Matérn kernel

The Matern kernel is a stationary kernel and a generalization of the RBF kernel. It has an additional parameter ν which controls the smoothness of the resulting function. It is parameterized by a length-scale parameter λ , which can either be a scalar (isotropic variant of the kernel) or a vector with the same number of dimensions as the inputs (anisotropic variant of the kernel). The kernel is given by:

k(xi,xj)=σ21Γ(ν)2ν1(γ2ν|xx|λ))νKν(γ2ν|xx|λ))

where Kν is a modified Bessel function.

Example: modeling stock prices, random walks, non-differentiable stochastic processes

(v) the Dot-Product kernel

The Dot-Product kernel is non-stationary and can be obtained from linear regression by putting N(0,1) priors on the coefficients of xd(d=1D) and a prior of N(0,σ02) on the bias. The Dot-Product kernel is invariant to a rotation of the coordinates about the origin, but not translations. It is parameterized by a parameter σ02. For σ02=0, the kernel is called the homogeneous linear kernel, otherwise it is inhomogeneous. The kernel is given by

k(x,x)=σ02+xx

The Dot-Product kernel is commonly combined with exponentiation.

Example: polynomial regression

(c) How to model a periodic signal with noise using a GP

Kernel functions are additive and multiplicative. To model a purely periodic process with white noise, we can construct a new kernel, consisitng of the Exp-Sine-Squared kernel plus the white noise kernel. In order to allow the model to adapt to variations of the amplitude A of the periodic signal, we vary the periodic kernel using an RBF-kernel.

(d) Matérn kernel

The Matern kernel is a stationary kernel and a generalization of the RBF kernel. It has an additional parameter ν which controls the smoothness of the resulting function. It is parameterized by a length-scale parameter λ , which can either be a scalar (isotropic variant of the kernel) or a vector with the same number of dimensions as the inputs (anisotropic variant of the kernel). The kernel is given by:

k(xi,xj)=σ21Γ(ν)2ν1(γ2ν|xx|λ))νKν(γ2ν|xx|λ))

where Kν is a modified Bessel function.

Example: modeling stock prices, random walks, non-differentiable stochastic processes

Summary

We close our inspection of our Gaussian processes by pointing out some reasons why Gaussian processes are an attractive model for use in regression problems and in some cases may be preferable to alternative models (such as linear and locally-weighted linear regression):

  1. As Bayesian methods, Gaussian process models allow one to quantify uncertainty in predictions resulting not just from intrinsic noise in the problem but also the errors in the parameter estimation procedure. Furthermore, many methods for model selection and hyperparameter selection in Bayesian methods are immediately applicable to Gaussian processes (though we did not address any of these advanced topics here).
  2. Like locally-weighted linear regression, Gaussian process regression is non-parametric and hence can model essentially arbitrary functions of the input points.
  3. Gaussian process regression models provide a natural way to introduce kernels into a regression modeling framework. By careful choice of kernels, Gaussian process regression models can sometimes take advantage of structure in the data.
  4. Gaussian process regression models, though perhaps somewhat tricky to understand conceptually, nonetheless lead to simple and straightforward linear algebra implementations.

References

[1] Carl E. Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. Online: http://www.gaussianprocess.org/gpml/

[2] Chuong B. Do: Gaussian Processes, University of Stanford (2007)

[3] Neil D. Lawrence, Nicolas Durande: GPy introduction covariance functions, Machine Learning Summer School, Sydney, Australia (2015)

Machine Learning
MSE FTP MachLe
Christoph Würsch

Lab 9, A4: Gaussian process regression (GPR) with noise-level estimation

This example illustrates that GPR with a sum-kernel including a WhiteKernel can estimate the noise level of data.

(a) Inspect and interpret the data using a plot

Have a look at the data and make a good guess for the kernel to be selected.

(b) Create a suitable kernel for the covariance function

The data shows a sin-like oscillation that is noisy. So it makes sense to select for the oscillating part of the covariance either

  1. the sin-exponential kernel
  2. the RBF-kernel

For the noisy part, we could either choose a white noise kernel (WhiteKernel). We start with the RBF-kernel and additive white noise.

(c) using the ExpSineSquared kernel

References

[1] Jan Hendrik Metzen jhm@informatik.uni-bremen.de

Machine Learning
MSE FTP MachLe
Christoph Würsch

Simple Gaussian Process

MSE FTP_MachLe HS2020 Christoph Würsch, Institute for Computational Engineering ICE, OST

A Gaussian process is a stochastic process such that any finite subcollection of random variables has a multivariate Gaussian distribution. In particular, a collection of random variables {f(x):xX} is said to be drawn from a Gaussian process GP with mean function m(·) and covariance function k(·,·) if for any finite set of elements x1,...,xmX, the associated finite set of random variables f(x1),,f(xm) have distribution,

When we form a Gaussian process GP we assume data is jointly Gaussian with a particular mean and covariance,

p(f|X)N(m(X),K(X)),fGP(m(x),k(x,x)),

where the conditioning is on the inputs X which are used for computing the mean and covariance. For this reason they are known as mean and covariance functions. To make things clearer, let us assume, that the mean function m(X) is zero, i.e. the jointly Gaussian distribution is centered around zero.

Generate random x-values:

Calculate the distance matrix D :

Dij=|xixj|pp

Calculate the mean vector m and the covariance matrix Kij=dijp

Machine Learning
MSE FTP MachLe
Christoph Würsch

Lab10 A2 Value Iteration GridWorld

Reinforcement Learning — Implement Grid World

Introduction of Value Iteration

Based on a code by Jeremy Zhang

Gridworld is the most basic as well as classic problem in reinforcement learning and by implementing it on your own, I believe, is the best way to understand the basis of reinforcement learning.

The rule is simple.

Grid Board

drawing

When our agent takes an action, the Environment should have a function to accept an action and return a legal position of next state.

Agent and Value Iteration

This is the artificial intelligence part, as our agent should be able to learn from the process and thinks like a human. The key of the magic is value iteration.

Value Iteration

What our agent will finally learn is a policy π, and a policy is a mapping from state s to action a, simply instructs what the agent should do at each state. In our case, instead of learning a mapping from state to action, we will leverage value iteration to learn a mapping of state to value (which is the estimated reward) and based on the estimation, at each state, our agent will choose the best action that gives the highest estimated reward. There is not going to be any cranky, head-scratching math involved, as the core of value iteration is amazingly concise.

V(st)V(st)+α[R+V(st+1)V(st)]

This is the essence of value iteration. This formula almost applies to all reinforcement learning problems. Value iteration, just as its name, updates its value (estimated reward) at each iteration (end of game).

At first, our agent knows nothing about the grid world (environment), so it would simply initialises all rewards as 0. Then, it starts to explore the world by randomly walking around, surely it will endure lots of failure at the beginning, but that is totally fine. Once it reaches end of the game, either reward +1 or reward -1, the whole game reset and the reward propagates in a backward fashion and eventually the estimated value of all states along the way will be updated based on the formula above.

The V(St) on the left is the updated value of that state, and the right one is the current non-updated value and α is learning rate. The formula is simply saying that the updated value of a state equals to the current value plus a temporal difference, which is what the agent learned from this iteration of game playing minus the previous estimate.

(f) Plots of the norm of the value function V(s)2

Machine Learning
MSE FTP MachLe
Christoph Würsch

Lab10 A3 Q-Learning in the Grid-World

(Lab10_A3_QLearning_Gridboard_SOLUTION.ipynb)

Adapted from Jeremy Zhang:

Whereas V(s) is a mapping from state s to estimated value of that state, the Q function — Q(s,a) is only one component different from V function. Instead of thinking that you receive a value when you are in a specific state s, think one step forward, you are in a state s and by taking a specific action a you receive a corresponding value.

The result of grid world using value iteration was that we get a estimated value V(s) for each state, but to have our policy π(s,a), which is a mapping from state s to action a, we need to go one step further by choosing the action that could get into the maximum value of next state. However, in Q-function, state and action are paired in the first place, which means when one has the optimal Q-function, he has the optimal action of that state.

Besides the Q-function, we are going to add more fun to our game:

Non-deterministic means that the agent will not be able to go where it intends to go. When it takes an action, it will has a probability to crash in a different action.

drawing


When the agent moving up, it has 0.8 probability moving up and 0.1 probability of going left or right. (non-deterministic)

Q-learning algorithm

One of the early breakthroughs in reinforcement learning was the development of an off-policy TD control algorithm known as Q-learning (Watkins, 1989), defined by

Q(st,at)Q(st,at)+α[Rt+1+γmaxaQ(st+1,a)Q(st,at)]

The Q-learning-algorithm is intrinsically off-policy, since it does not draw samples from a given policy π.

  1. Perform an action a in a state s to end up in s with reward r, i.e. consider the tuple (s,a,s,r)
  2. Compute the intermediate Q-value: Q^(s,a)=R(s,a,s)+γmaxaQk(s,a)

  3. Incorporate that new evidence into the existing value using a running average: (α is the learning rate) Qk+1(s,a)=(1α)Qk(s,a)+αQ^(s,a)

This can be rewritten in a gradient-descent update rule fashion as:

Qk+1(s,a)=Qk(s,a)+α(Q^(s,a)Qk(s,a))

Temporal Difference Learning

Action

Machine Learning
MSE FTP MachLe
Christoph Würsch

Tic Tac Toe

Two players against each other

drawing

Board State


Reflect & Judge the state

2 players p1 and p2; p1 uses symbol 1 and p2 uses symbol 2, vacancy as 0

Training

Human vs Computer

Machine Learning
MSE FTP MachLe
Christoph Würsch

The Curse of Dimensionality

In machine learning, the "curse of dimensionality" is often stated but much less often explained. At least not in detail. One just gets told that points are farer away from each other in high dimensional spaces.

1. Maximum minimal distance ¶

One approach to this is to calculate the maximum minimal distance of

One approach to this is to calculate the maximum minimal distance of k points in [0,1]. So you try to place k points in such a way, that the minimum over the pairwise distances of those k points is maximal. Let's call this α(n,k). However, it is not easily possible to calculate α(n,k) for arbitrary n>2.

But the special case k=2 and n=2 it is easy:

So you can see that two points get can be farer apart in higher dimensions and that it needs much more points in higher dimensions to force at least two of them to have distance 1.

2. Average distance

Another approach is to calculate the average Euclidian distance of k uniformly randomly sampled points in [0,1]n. Let's call it β(n,k).

One first insight is that β(n,k)=β(n,j) for k,j2. Hence, we will only use β(n) in the following. It is possible to calculate this, but it is rather tedious (see link).

Just two calculated solutions for k=2:

We have to compute: (1)I=[0,1]4(x1x2)2+(y1y2)2dμ. Assuming that X1 and X2 are two independent random variables, uniformly distributed over [0,1], the pdf of their difference ΔX=X1X2 is given by:

(1)I=[0,1]4(x1x2)2+(y1y2)2dμ.

hence: I=[1,1]2(1|x|)(1|y|)x2+y2dxdy(3)=4[0,1]2xy(1x)2+(1y)2dxdy that is tedious to compute but still possible; we have:

I=2+2+5arcsinh(1)15=2+2+5log(1+2)15=0.52140543316472

In general, this distance is very hard to compute, see MathWorld.

3. A simple upper bound for the average Euclidian distance

A simple argument via Jensen's inequality gives an upper bound; we first note that if X=(X1,,Xn) is uniform on [0,1]n then each of the marginals Xi[0,1] are independent and uniform. Then

β(n)=E[|XY|]=E[(i=1n(XiYi)2)1/2]E[i=1n(XiYi)2]1/2=nE[(X1Y1)2]1/2=n6

This is because the distance squared for n=1 is given by: E[Δx2]=[p(x)xp(x)x]2dxdx=0101(xx)2dxdx=0101(x22xx+x2)dxdx=16

4. Density of Hypercubes

One interesting question is how much of the n-dimensional hypercube can be filled by one inscribed n-dimensional hyperball. The volume of an n-dimensional hypercube is VC(a)=an where a is the cubes side length. So for one dimension it is a, for 2 dimensions (a square) it is a2 and for n=3 it is the volume of a cube.

The volume of an n-dimensional ball is given by:

VS(r)=rnπn/2Γ(n2+1)n=1:rπΓ(1.5)=rπ0.5Γ(0.5)=2rn=2:r2πΓ(2)=r2πΓ(1)=r2πn=3:r3π3/2Γ(52)=r3π3/21.50.5Γ(12)=r3π34

This means the percentage of space of a unit hypercube which can be filled by the biggest inscribed hyperball is:

(14)VS(0.5)VC(1)=rnπn/2Γ(n2+1)1=0.5nπn/2Γ(n2+1)(15)=0.5nπn/2n2Γ(n2)=0.5n2πn/2n2n2!n(16)=0.5nπn/2n2!

You can see that this term goes to zero with increasing dimension. This means most of the volume is not in the center, but in the edges of the n dimensional hypercube. It also means that k nearest neighbors with Euclidean Distance measure will need enormously large spheres to get to the next neighbours.

5. Average Angle

One interesting question is how the average angle between two points (and the origin) changes with higher dimensions. Suppose all points are in the [0,1]n hypercube.

I thought about this for a while and came to the conclusion that it should be 90° in average due to symmetry. No matter how high the dimension is, because the higer the dimension n becomes, the closer the data is concentrated to the axes.

A short experiment confirms that:

Sebastian Raschka
last updated: 02/07/2017

Stepping through a Principal Component Analysis

(using Python's numpy and matplotlib)

Sections


Introduction

The main purposes of a principal component analysis are the analysis of data to identify patterns and finding patterns to reduce the dimensions of the dataset with minimal loss of information.

Here, our desired outcome of the principal component analysis is to project a feature space (our dataset consisting of n d-dimensional samples) onto a smaller subspace that represents our data "well". A possible application would be a pattern classification task, where we want to reduce the computational costs and the error of parameter estimation by reducing the number of dimensions of our feature space by extracting a subspace that describes our data "best".

Principal Component Analysis (PCA) Vs. Multiple Discriminant Analysis (MDA)

Both Multiple Discriminant Analysis (MDA) and Principal Component Analysis (PCA) are linear transformation methods and closely related to each other. In PCA, we are interested to find the directions (components) that maximize the variance in our dataset, where in MDA, we are additionally interested to find the directions that maximize the separation (or discrimination) between different classes (for example, in pattern classification problems where our dataset consists of multiple classes. In contrast two PCA, which ignores the class labels).
In other words, via PCA, we are projecting the entire set of data (without class labels) onto a different subspace, and in MDA, we are trying to determine a suitable subspace to distinguish between patterns that belong to different classes. Or, roughly speaking in PCA we are trying to find the axes with maximum variances where the data is most spread (within a class, since PCA treats the whole data set as one class), and in MDA we are additionally maximizing the spread between classes.
In typical pattern recognition problems, a PCA is often followed by an MDA.

What is a "good" subspace?

Let's assume that our goal is to reduce the dimensions of a d-dimensional dataset by projecting it onto a (k)-dimensional subspace (where k<d). So, how do we know what size we should choose for k, and how do we know if we have a feature space that represents our data "well"?
Later, we will compute eigenvectors (the components) from our data set and collect them in a so-called scatter-matrix (or alternatively calculate them from the covariance matrix). Each of those eigenvectors is associated with an eigenvalue, which tell us about the "length" or "magnitude" of the eigenvectors. If we observe that all the eigenvalues are of very similar magnitude, this is a good indicator that our data is already in a "good" subspace. Or if some of the eigenvalues are much much higher than others, we might be interested in keeping only those eigenvectors with the much larger eigenvalues, since they contain more information about our data distribution. Vice versa, eigenvalues that are close to 0 are less informative and we might consider in dropping those when we construct the new feature subspace.

Summarizing the PCA approach

Listed below are the 6 general steps for performing a principal component analysis, which we will investigate in the following sections.

  1. Take the whole dataset consisting of d-dimensional samples ignoring the class labels
  2. Compute the d-dimensional mean vector (i.e., the means for every dimension of the whole dataset)
  3. Compute the scatter matrix (alternatively, the covariance matrix) of the whole data set
  4. Compute eigenvectors (ee1,ee2,...,eed) and corresponding eigenvalues (λλ1,λλ2,...,λλd)
  5. Sort the eigenvectors by decreasing eigenvalues and choose k eigenvectors with the largest eigenvalues to form a d×k dimensional matrix WW(where every column represents an eigenvector)
  6. Use this d×k eigenvector matrix to transform the samples onto the new subspace. This can be summarized by the mathematical equation: yy=WWT×xx (where xx is a d×1-dimensional vector representing one sample, and yy is the transformed k×1-dimensional sample in the new subspace.)



Generating some 3-dimensional sample data

For the following example, we will generate 2N=80 3-dimensional samples randomly drawn from a multivariate Gaussian distribution.
Here, we will assume that the samples stem from two different classes, where one half (i.e., N=40) samples of our data set are labeled ω1 (class 1) and the other half ω2 (class 2).

μ1μ1= [000] μ2μ2= [111](sample means)

Σ1Σ1= [100010001] Σ2Σ2= [100010001] (covariance matrices)

Why are we chosing a 3-dimensional sample?

The problem of multi-dimensional data is its visualization, which would make it quite tough to follow our example principal component analysis (at least visually). We could also choose a 2-dimensional sample data set for the following examples, but since the goal of the PCA in an "Diminsionality Reduction" application is to drop at least one of the dimensions, I find it more intuitive and visually appealing to start with a 3-dimensional dataset that we reduce to an 2-dimensional dataset by dropping 1 dimension.

Using the code above, we created two 3×20 datasets - one dataset for each class ω1 and ω2 -
where each column can be pictured as a 3-dimensional vector xx=(x1x2x3) so that our dataset will have the form
XX=(x11x12...x120x21x22...x220x31x32...x320)

Just to get a rough idea how the samples of our two classes ω1 and ω2 are distributed, let us plot them in a 3D scatter plot.



1. Taking the whole dataset ignoring the class labels

Because we don't need class labels for the PCA analysis, let us merge the samples for our 2 classes into one 3×40-dimensional array.



2. Computing the d-dimensional mean vector



3. a) Computing the Scatter Matrix S

The scatter matrix is computed by the following equation:
S=k=1n(xxkμμ)(xxkμμ)T
where μμ is the mean vector
μμ=1nk=1nxxk



3. b) Computing the Covariance Matrix Σ=C=1N1S

(alternatively to the scatter matrix)

Alternatively, instead of calculating the scatter matrix, we could also calculate the covariance matrix using the in-built numpy.cov() function. The equations for the covariance matrix and scatter matrix are very similar, the only difference is, that we use the scaling factor 1N1 (here: 1401=139) for the covariance matrix. Thus, their eigenspaces will be identical (identical eigenvectors, only the eigenvalues are scaled differently by a constant factor).

Σi=[σ112σ122σ132σ212σ222σ232σ312σ322σ332]

Scaled Scatter Matrix S



4. Computing eigenvectors and corresponding eigenvalues

To show that the eigenvectors are indeed identical whether we derived them from the scatter or the covariance matrix, let us put an assert statement into the code. Also, we will see that the eigenvalues were indeed scaled by the factor 39 when we derived it from the scatter matrix.

Checking the eigenvector-eigenvalue calculation

Let us quickly check that the eigenvector-eigenvalue calculation is correct and satisfy the equation

ΣΣvv=λvv


where
ΣΣ=Covariancematrixvv=Eigenvectorλ=Eigenvalue

Visualizing the eigenvectors

And before we move on to the next step, just to satisfy our own curiosity, we plot the eigenvectors centered at the sample mean.





5.1. Sorting the eigenvectors by decreasing eigenvalues

We started with the goal to reduce the dimensionality of our feature space, i.e., projecting the feature space via PCA onto a smaller subspace, where the eigenvectors will form the axes of this new feature subspace. However, the eigenvectors only define the directions of the new axis, since they have all the same unit length 1, which we can confirm by the following code:

So, in order to decide which eigenvector(s) we want to drop for our lower-dimensional subspace, we have to take a look at the corresponding eigenvalues of the eigenvectors. Roughly speaking, the eigenvectors with the lowest eigenvalues bear the least information about the distribution of the data, and those are the ones we want to drop.
The common approach is to rank the eigenvectors from highest to lowest corresponding eigenvalue and choose the top k eigenvectors.



5.2. Choosing k eigenvectors with the largest eigenvalues

For our simple example, where we are reducing a 3-dimensional feature space to a 2-dimensional feature subspace, we are combining the two eigenvectors with the highest eigenvalues to construct our d×k-dimensional eigenvector matrix WW.



6. Transforming the samples onto the new subspace

In the last step, we use the 2×3-dimensional matrix WW that we just computed to transform our samples onto the new subspace via the equation yy=WWT×xx.



Using the PCA() class from the sklearn.decomposition library to confirm our results

In order to make sure that we have not made a mistake in our step by step approach, we will use another library that doesn't rescale the input data by default.
Here, we will use the PCA class from the scikit-learn machine-learning library. The documentation can be found here:
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html.

For our convenience, we can directly specify to how many components we want to reduce our input dataset via the n_components parameter.

Next, we just need to use the .fit_transform() in order to perform the dimensionality reduction.

Depending on your computing environmnent, you may find that the plot above is the exact mirror image of the plot from out step by step approach. This is due to the fact that the signs of the eigenvectors can be either positive or negative, since the eigenvectors are scaled to the unit length 1, both we can simply multiply the transformed data by ×(1) to revert the mirror image.

Please note that this is not an issue: If v is an eigenvector of a matrix Σ, we have,

Σv=λv,

where λ is our eigenvalue. Then v is also an eigenvector that has the same eigenvalue, since

Σ(v)=Σv=λv=λ(v).

Also, see the note in the scikit-learn documentation:

Due to implementation subtleties of the Singular Value Decomposition (SVD), which is used in this implementation, running fit twice on the same matrix can lead to principal components with signs flipped (change in direction). For this reason, it is important to always use the same estimator object to transform data in a consistent fashion.

(http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)

Machine Learning
MSE FTP MachLe
Christoph Würsch

V11 Dimensionality Reduction (TSM_MachLE)

Comparison of different methods

based on a Jupyter notebook by Aurelien Géron

Setup

First, let's make sure this notebook works well in both python 2 and 3, import a few common modules, ensure MatplotLib plots figures inline and prepare a function to save the figures:

Generate a 3D dataset:

PCA using Scikit-Learn

With Scikit-Learn, PCA is really trivial. It even takes care of mean centering for you:

Of course, there was some loss of information during the projection step, so the recovered 3D points are not exactly equal to the original 3D points:

We can compute the reconstruction error:

The PCA object gives access to the principal components that it computed:

Now let's look at the explained variance ratio:

The first dimension explains 84.2% of the variance, while the second explains 14.6%.

By projecting down to 2D, we lost about 1.1% of the variance:

Next, let's generate some nice figures! :)

Utility class to draw 3D arrows (copied from http://stackoverflow.com/questions/11140163)

Express the plane as a function of x and y.

Plot the 3D dataset, the plane and the projections on that plane.

Manifold learning

Swiss roll:

PCA

Kernel PCA

LLE

MDS, Isomap and t-SNE

Machine Learning
MSE FTP MachLe
Christoph Würsch

Lab 11, A0 Principal Component Analysis

Up until now, we have been looking in depth at supervised learning estimators: those estimators that predict labels based on labeled training data. Here we begin looking at several unsupervised estimators, which can highlight interesting aspects of the data without reference to any known labels.

In this section, we explore what is perhaps one of the most broadly used of unsupervised algorithms, principal component analysis (PCA). PCA is fundamentally a dimensionality reduction algorithm, but it can also be useful as a tool for visualization, for noise filtering, for feature extraction and engineering, and much more. After a brief conceptual discussion of the PCA algorithm, we will see a couple examples of these further applications.

We begin with the standard imports:

Introducing Principal Component Analysis

Principal component analysis is a fast and flexible unsupervised method for dimensionality reduction in data, which we saw briefly in Introducing Scikit-Learn. Its behavior is easiest to visualize by looking at a two-dimensional dataset. Consider the following 200 points:

By eye, it is clear that there is a nearly linear relationship between the x and y variables. This is reminiscent of the linear regression data, but the problem setting here is slightly different: rather than attempting to predict the y values from the x values, the unsupervised learning problem attempts to learn about the relationship between the x and y values.

In principal component analysis, this relationship is quantified by finding a list of the principal axes in the data, and using those axes to describe the dataset. Using Scikit-Learn's PCA estimator, we can compute this as follows:

The fit learns some quantities from the data, most importantly the "components" and "explained variance":

To see what these numbers mean, let's visualize them as vectors over the input data, using the "components" to define the direction of the vector, and the "explained variance" to define the squared-length of the vector:

These vectors represent the principal axes of the data, and the length of the vector is an indication of how "important" that axis is in describing the distribution of the data—more precisely, it is a measure of the variance of the data when projected onto that axis. The projection of each data point onto the principal axes are the "principal components" of the data.

If we plot these principal components beside the original data, we see the plots shown here:

This transformation from data axes to principal axes is an affine transformation, which basically means it is composed of a translation, rotation, and uniform scaling.

While this algorithm to find principal components may seem like just a mathematical curiosity, it turns out to have very far-reaching applications in the world of machine learning and data exploration.

PCA as dimensionality reduction

Using PCA for dimensionality reduction involves zeroing out one or more of the smallest principal components, resulting in a lower-dimensional projection of the data that preserves the maximal data variance.

Here is an example of using PCA as a dimensionality reduction transform:

The transformed data has been reduced to a single dimension. To understand the effect of this dimensionality reduction, we can perform the inverse transform of this reduced data and plot it along with the original data:

The light points are the original data, while the dark points are the projected version. This makes clear what a PCA dimensionality reduction means: the information along the least important principal axis or axes is removed, leaving only the component(s) of the data with the highest variance. The fraction of variance that is cut out (proportional to the spread of points about the line formed in this figure) is roughly a measure of how much "information" is discarded in this reduction of dimensionality.

This reduced-dimension dataset is in some senses "good enough" to encode the most important relationships between the points: despite reducing the dimension of the data by 50%, the overall relationship between the data points are mostly preserved.

Principal Component Analysis Summary

In this section we have discussed the use of principal component analysis for dimensionality reduction, for visualization of high-dimensional data, for noise filtering, and for feature selection within high-dimensional data. Because of the versatility and interpretability of PCA, it has been shown to be effective in a wide variety of contexts and disciplines. Given any high-dimensional dataset, I tend to start with PCA in order to visualize the relationship between points (as we did with the digits), to understand the main variance in the data (as we did with the eigenfaces), and to understand the intrinsic dimensionality (by plotting the explained variance ratio). Certainly PCA is not useful for every high-dimensional dataset, but it offers a straightforward and efficient path to gaining insight into high-dimensional data.

PCA's main weakness is that it tends to be highly affected by outliers in the data. For this reason, many robust variants of PCA have been developed, many of which act to iteratively discard data points that are poorly described by the initial components. Scikit-Learn contains a couple interesting variants on PCA, including RandomizedPCA and SparsePCA, both also in the sklearn.decomposition submodule. RandomizedPCA, which we saw earlier, uses a non-deterministic method to quickly approximate the first few principal components in very high-dimensional data, while SparsePCA introduces a regularization term that serves to enforce sparsity of the components.

In the following sections, we will look at other unsupervised learning methods that build on some of the ideas of PCA.

Machine Learning
MSE FTP MachLe
Christoph Würsch

Lab11, A4 Necessity of Feature-Scaling for the PCA

Feature scaling through standardization (or Z score normalization) can be an important pre-processing step for many machine learning processes. Since many algorithms (such as SVM, K-nearest neighbors, and logistic regression) require the normalization of features, we can analyze the importance of scaling data using the example of Principal Component Analysis (PCA).

In PCA, we are interested in the components that maximize variance. If one component (e.g., height) varies less than another (e.g., weight) just because different scales are used (meters vs. kilos), PCA could determine that the direction of maximum variance is assigned to weight rather than size if these characteristics are not scaled. But the change in height of one meter can be considered much more important than the change in weight of one kilogram. This assignment is therefore clearly wrong.

To illustrate this, we now go through a PCA by scaling the data with the class StandardScaler from the module sklearn.preprocessing. The results are visualized and compared with the results of unscaled data. We will notice a clearer difference when using standardization. The data set used is the wine data set available from UCI. This data set has continuous features that are heterogeneous due to the different magnitudes of the characteristics they measure (e.g. alcohol content and malic acid).

The transformed data is then used to train a naive Bayesian classifier. Significant differences in predictive accuracy can be observed, with the data set that was scaled before PCA was applied far exceeding the unscaled version.

(a) Import of the used classes

(b) Split in training and test dataset

(c) Create a pipeline

We use a Pipeline to perform a PCA with two main components and then train a Naive Bayes classifier. Then we look at the Accuracy on the test data. We do this without scaling the features.

(d) PCA using 4 principal components (without standardization / scaling of the data)

(f) New pipeline with scaled features

The features are now scaled for comparison.

(g) Prediction accuracy for the scaled data (using two principal components)

(h) plotting the main components

Now we get the main components, once for the unscaled, once for the scaled case.

The 1st main component in the unscaled set can be seen. You can see that feature #13 dominates the direction because it is several orders of magnitude above the other features. This is in contrast to viewing the main component for the scaled version of the data. In the scaled version, the orders of magnitude are approximately the same for all features.

Machine Learning
MSE FTP MachLe
Christoph Würsch

Lab11, A5 – Dimensionality Reduction

Exercice 5: Stochastic Neighbour Embedding on MNIST dataset

Setup

First, let's make sure this notebook works well in both python 2 and 3, import a few common modules, ensure MatplotLib plots figures inline and prepare a function to save the figures:

(a) Using t-SNE to reduce the dimensionality to two dimensions

Exercise: Use t-SNE to reduce the MNIST dataset down to two dimensions and plot the result using Matplotlib. You can use a scatterplot using 10 different colors to represent each image's target class.

Let's start by loading the MNIST dataset:

Dimensionality reduction on the full 60,000 images takes a very long time, so let's only do this on a random subset of 5,000 images:

Now let's use t-SNE to reduce dimensionality down to 2D so we can plot the dataset (This can take quite a while...):

Now let's use Matplotlib's scatter() function to plot a scatterplot, using a different color for each digit:

Isn't this just beautiful? :) This plot tells us which numbers are easily distinguishable from the others (e.g., 0s, 6s, and most 8s are rather well separated clusters), and it also tells us which numbers are often hard to distinguish (e.g., 4s and 9s, 5s and 3s, and so on).

(b) Labelling the Clusters

Let's focus on digits 2, 3 and 5, which seem to overlap a lot.

t-SNE instead of PCA

Let's see if we can produce a nicer image by running t-SNE on these 3 digits:

Much better, now the clusters have far less overlap. But some 3s are all over the place. Plus, there are two distinct clusters of 2s, and also two distinct clusters of 5s. It would be nice if we could visualize a few digits from each cluster, to understand why this is the case. Let's do that now.

Exercise: Alternatively, you can write colored digits at the location of each instance, or even plot scaled-down versions of the digit images themselves (if you plot all digits, the visualization will be too cluttered, so you should either draw a random sample or plot an instance only if no other instance has already been plotted at a close distance). You should get a nice visualization with well-separated clusters of digits.

Let's create a plot_digits() function that will draw a scatterplot (similar to the above scatterplots) plus write colored digits, with a minimum distance guaranteed between these digits. If the digit images are provided, they are plotted instead. This implementation was inspired from one of Scikit-Learn's excellent examples (plot_lle_digits, based on a different digit dataset).

Let's try it! First let's just write colored digits:

Well that's okay, but not that beautiful. Let's try with the digit images:

Exercise: Try using other dimensionality reduction algorithms such as PCA, LLE, or MDS and compare the resulting visualizations.

Let's start with PCA. We will also time how long it takes:

Wow, PCA is blazingly fast! But although we do see a few clusters, there's way too much overlap. Let's try LLE:

(c) Using other dimensionality reduction algorithms such as LLE, MDS and t-SNE

First we start with LLE, i.e. Local Linear Embedding which is part of the mainfold class of sklearn.

That took a while, and the result does not look too good. Let's see what happens if we apply PCA first, preserving 95% of the variance:

The result is more or less the same, but this time it was almost 4× faster.

Let's try MDS. It's much too long if we run it on 10,000 instances, so let's just try 2,000 for now:

Meh. This does not look great, all clusters overlap too much. Let's try with PCA first, perhaps it will be faster?

Same result, and no speedup: PCA did not help (or hurt).

Let's try LDA:

This one is very fast, and it looks nice at first, until you realize that several clusters overlap severely.

Well, it's pretty clear that t-SNE won this little competition, wouldn't you agree? We did not time it, so let's do that now:

It's twice slower than LLE, but still much faster than MDS, and the result looks great. Let's see if a bit of PCA can speed it up:

Yes, PCA roughly gave us a 25% speedup, without damaging the result. We have a winner!

Machine Learning
MSE FTP MachLe
Christoph Würsch

Lab 11, A6: Feature Engineering using PCA

First, let's make sure this notebook works well in both python 2 and 3, import a few common modules, ensure MatplotLib plots figures inline and prepare a function to save the figures:

Setup

(a) Loading the MINST Dataset

Exercise: Load the MNIST dataset (introduced in chapter 3) and split it into a training set and a test set (take the first 60,000 instances for training, and the remaining 10,000 for testing).

(b) Training a Random Forest classifier on the dataset

Exercise: Train a Random Forest classifier on the dataset and time how long it takes, then evaluate the resulting model on the test set.

(c) Use PCA to reduce the dataset’s dimensionality, with an explained variance ratio of 95%

Exercise: Next, use PCA to reduce the dataset's dimensionality, with an explained variance ratio of 95%.

Exercise: Train a new Random Forest classifier on the reduced dataset and see how long it takes. Was training much faster?

Oh no! Training is actually more than twice slower now! How can that be? Well, as we saw in this chapter, dimensionality reduction does not always lead to faster training time: it depends on the dataset, the model and the training algorithm. See figure 8-6 (the manifold_decision_boundary_plot* plots above). If you try a softmax classifier instead of a random forest classifier, you will find that training time is reduced by a factor of 3 when using PCA. Actually, we will do this in a second, but first let's check the precision of the new random forest classifier.

(d) Evaluate the classifier on the test set: how does it compare to the previous classifier?

It is common for performance to drop slightly when reducing dimensionality, because we do lose some useful signal in the process. However, the performance drop is rather severe in this case. So PCA really did not help: it slowed down training and reduced performance. :(

Let's see if it helps when using softmax regression:

Okay, so softmax regression takes much longer to train on this dataset than the random forest classifier, plus it performs worse on the test set. But that's not what we are interested in right now, we want to see how much PCA can help softmax regression. Let's train the softmax regression model using the reduced dataset:

Nice! Reducing dimensionality led to a 4× speedup. :) Let's the model's accuracy:

A very slight drop in performance, which might be a reasonable price to pay for a 4× speedup, depending on the application.

So there you have it: PCA can give you a formidable speedup... but not always!

Machine Learning
MSE FTP MachLe
Christoph Würsch

Lab12: A2_Elbow Curve and sklearn.cluster.KMeans

Based on Python Machine Learning 2nd Edition by Sebastian Raschka, Packt Publishing Ltd. 2017

Grouping objects by similarity using k-means

K-means clustering using scikit-learn

Using the following code lines, you can generate two-dimensional data clusters that can be used for testing clustering algorithms. In this exercise, you will learn how to apply k-means and how to determine the optimum number of clusters using the elbow criterium of the inertia plot.

(a) Scatterplot

Generate a distribution of 8 clusters with 250 samples and plot them as a scatterplot. How many clusters do you recognize with your eye. Try to change the cluster standard deviation cluster_std until it will be hard for you to discriminate the 8 different clusters.

b) KMeans

Import the method KMeans from sklearn.cluster. Instantiate a model km with 8 clusters (n_clusters=8). Set the maximum number of iterations to max_iter=300 and n_init=10.

Fit the model to the data and predict the cluster label using km.fit_predict(X). Hint: One way to deal with convergence problems is to choose larger values for tol, which is a parameter that controls the tolerance with regard to the changes in the within-cluster sum-squared-error to declare convergence. Try a tolerance of 1e-04.

(c) Display the clustered data

Use the function PlotClusters to display the clustered data.

d) Variation of the number k of clusters

Vary the number of clusters n_clusters=8 in your KMeans clustering algorithm from 4 to 8 and display each time the result using the function PlotClusters.


Using the elbow method to find the optimal number of clusters

(e) Elbow Method

Vary in a for loop the number of clusters from n_clusters=8 to n_clusters=15 and cluster the data each time using the km.fit_predict method. Read out the inertia km.inertia_ and store it in a list called distortions as function of the number of clusters using the append method. Display the inertia as function of the number of clusters and determine the optimum number of clusters from the elbow curve.

(f) kMeans++

Without explicit defintion, a random seed is used to place the initial centroids, which can sometimes result in bad clusterings or slow convergence. Another strategy is to place the initial centroids far away from each other via the k-means++ algorithm, which leads to better and more consistent results than the classic k-means. This can be selected in sklearn.cluster.KMeans by setting init=k-means++.


Bonus

Quantifying the quality of clustering via silhouette plots

Comparison to "bad" clustering:



Organizing clusters as a hierarchical tree

Grouping clusters in bottom-up fashion


Performing hierarchical clustering on a distance matrix

We can either pass a condensed distance matrix (upper triangular) from the pdist function, or we can pass the "original" data array and define the metric='euclidean' argument in linkage. However, we should not pass the squareform distance matrix, which would yield different distance values although the overall clustering could be the same.


Attaching dendrograms to a heat map


Applying agglomerative clustering via scikit-learn



Locating regions of high density via DBSCAN

K-means and hierarchical clustering:

Density-based clustering:



Machine Learning
MSE FTP MachLe
Christoph Würsch

Lab 12: A3 k-Means, Gaussian Mixture Models and the EM algorithm

To gain understanding of mixture models we have to start at the beginning with the expectation maximization algorithm and it's application

First a little history on the EM-algorithm

Reference: 4

Demptser, Laird & Rubin (1977) -unified previously unrelated work under "The EM Algorithm"

- unified previously unrelated work under "The EM Algorithm"
- overlooked E-M works - see gaps between foundational authors
    - Newcomb (1887)
    - McKendrick (1926) [+39 years]
    - Hartley (1958) [+32 years]
    - Baum et. al. (1970) [+12 years]
    - Dempters et. al. (1977) [+7 years]

EM Algorithm developed over 90 years

EM provides general framework for solving problems

Examples include:

- Filling in missing data from a sample set
- Discovering values of latent variables
- Estimating parameters of HMMs
- Estimating parameters of finite mixtures [models]
- Unsupervised learning of clusters
- etc...

(a) How does the EM algorithm work?

EM is an iterative process that begins with a "naive" or random initialization and then alternates between the expectation and maximization steps until the algorithm reaches convergence.

To describe this in words imagine we have a simple data set consisting of class heights with groups separated by gender.

Now imagine that we did not have the convenient gender labels associated with each data point. How could we estimate the two group means?

First let's set up our problem.

In this example we hypothesize that these height data points are drawn from two distributions with two means - < μ1, μ2 >.

For this example, we will assume the x values are drawn from Gaussian distributions.

To start the algorithm, we choose two random means.

From there we repeat the following until convergence.

The expectation step:

We calculate the expected values E(zij), which is the probability that xi was drawn from the jth distribution.

E(zij)=p(x=xi|μ=μj)n=12p(x=xi|μ=μj)=e12σ2(xiμj)2n=12e12σ2(xiμn)2

The formula simply states that the expected value for zij is the probability xi given μj divided by the sum of the probabilities that xi belonged to each μ

The maximization step:

After calculating all E(zij) values we can calculate (update) new μ values.

μj=i=1mE(zij)xii=1mE(zij)

This formula generates the maximum likelihood estimate.

By repeating the E-step and M-step we are guaranteed to find a local maximum giving us a maximum likelihood estimation of our hypothesis.

What are Maximum Likelihood Estimates (MLE)

1. Parameters describe characteristics (attributes) of a population. These parameter values are estimated from samples collected from that population.

2. A MLE is a parameter estimate that is most consistent with the sampled data. By definition it maximizes the likelihood function. One way to accomplish this is to take the first derivative of the likelihood function w/ respect to the parameter theta and solve for 0. This value maximizes the likelihood function and is the MLE

A quick example of a maximum likelihood estimate

You flip a coin 10 times and observe the following sequence (H, T, T, H, T, T, T, T, H, T)

What's the MLE of observing 3 heads in 10 trials?

simple answer:

The frequentist MLE is (# of successes) / (# of trials) or 3/10

solving first derivative of binomial distribution answer:

(17)L(θ)=(103)θ3(1θ)7(18)logL(θ)=log(103)+3logθ+7log(1θ)(19)dlogL(θ)d(θ)=3θ71θ=0(20)3θ=71θ310

That's a MLE! This is the estimate that is most consistent with the observed data

Back to our height example. Using the generalized Gaussian mixture model code sourced from Duke's computational statistics we can visualize this process.

Notice how the algorithm was able to estimate the true means starting from random guesses for the parameters.

Now that we have a grasp of the algorithm we can examine K-Means as a form of EM

K-Means is an unsupervised learning algorithm used for clustering multidimensional data sets.

The basic form of K-Means makes two assumptions

1. Each data point is closer to its own cluster center than the other cluster centers
2. A cluster center is the arithmetic mean of all the points that belong to the cluster.

The expectation step is done by calculating the pairwise distances of every data point and assigning cluster membership to the closest center (mean)

The maximization step is simply the arithmetic mean of the previously assigned data points for each cluster

The following sections borrow heavily from Jake Vanderplas' Python Data Science Handbook

To build our intuition of this process, play with the following interactive code from Jake Vanderplas in an Jupyter (IPython) notebook

Now we are ready to explore some of the nuances/issues of implementing K-Means as an expectation maximization algorithm

the globally optimal result is not guaranteed

- EM is guaranteed to improve the result in each iteration but there are no guarantees that it will find the global best. See the following example, where we initalize the algorithm with a different seed.

practical solution:

- Run the algorithm w/ multiple random initializations
- This is done by default in sklearn

number of means (clusters) have to be selected beforehand

- k-means cannot learn the optimal number of clusters from the data. If we ask for six clusters it will find six clusters, which may or may not be meaningful.

practical solution:

- use a more complex clustering algorithm like Gaussian Mixture Models, or one that can choose a suitable number of clusters (DBSCAN, mean-shift, affinity propagation)

k-means is terrible for non-linear data:

- this results because of the assumption that points will be closer to their own cluster center than others

practical solutions:

- transform data into higher dimension where linear separation is possible e.g., spectral clustering

K-Means is known as a hard clustering algorithm because clusters are not allowed to overlap.

"One way to think about the k-means model is that it places a circle (or, in higher dimensions, a hyper-sphere) at the center of each cluster, with a radius defined by the most distant point in the cluster. This radius acts as a hard cutoff for cluster assignment within the training set: any point outside this circle is not considered a member of the cluster. -- [Jake VanderPlas Python Data Science Handbook] [1]

A resulting issue of K-Means' circular boundaries is that it has no way to account for oblong or elliptical clusters.

There are two ways we can extend K-Means

1. measure uncertainty in cluster assignments by comparing distances to all cluster centers
2. allow for flexibility in the shape of the cluster boundaries by using ellipses

Recall our previous height example, and let's assume that each cluster is a Gaussian distribution!

Gaussian distributions give flexibility to the clustering, and the same basic two step E-M algorithm used in K-Means is applied here as well.

  1. Randomly initialize location and shape
  2. Repeat until converged:

    E-step: for each point, find weights encoding the probability of membership in each cluster.
    
    M-step: for each cluster, update its location, normalization, and shape based on all data points, making use of the weights

The result of this process is that we end up with a smooth Gaussian cluster better fitted to the shape of the data, instead of a rigid inflexible circle.

Note that because we still are using the E-M algorithm there is no guarantee of a globally optimal result. We can visualize the results of the model.

Notice how much better the model is able to fit the clusters when we assume each cluster is a Gaussian distribution instead of circle whose radius is defined by the most distant point.

Gaussian Mixture Models as a tool for Density Estimation

The technical term for this type of model is:

generative probabilistic model

Why you ask?

Because this model is really about characterizing the distribution of the entire dataset and not necessarily clustering. The power of these types of models is that they allow us to generate new samples that mimic the original underlying data!

If we try to cluster this data set we run into the same issue as before.

Instead let's ignore individual clusters and model the whole distribution of data as a collection of many Gaussians.

Looks like the collection of clusters has fit the data set reasonably well. Now let's see if the model has actually learned about this data set, such that we can create entirely new samples that look like the original.

Generative models allow for multiple methods to determine optimal number of components. Because it is a probability distribution we can evaluate the likelihood of the data using cross validation and/or using AIC or BIC.

Sklearn makes this easy.

Machine Learning
MSE FTP MachLe
Christoph Würsch

Lab12: A4 Image Compression using kMeans

Clustering can be used to reduce colors in an image. Similar colors will be assigned to the same cluster label or color palette. In In the following exercise, you will load an image as a [w,h,3] numpy.array of type float64 , where w and h are the width and height in pixels respecively. The last dimension of the three dimensional array are three the RGB color channels. Using kMeans, we will reduce the color depth from 24 bits to 64 colors (6 bits) and to 16 colors (4 bits).

(a) Reading an Image

Start by reading in an image from the Python imaging library PIL (https://en.wikipedia.org/wiki/Python_Imaging_Library) in your Jupyter notebook.

(b) Flatten the image

Flatten the image to a [wh,3]-dimensional numpy.array and shuffle the pixels using sklearn.utils.shuffle.

(c) Quantization to 64 colors

Create an instance of the kMeans class called estimator. Use the fit method of kMeans to create sixty-four clusters (n_clusters=64) from a sample of one thousand randomly selected colors, e.g. the first 1000 colors of the shuffled pixels. The new color palette is given by the cluster centers that are accessible in estimator.cluster_centers_. Each of the clusters will be a color in the compressed palette.

(d) Prediction of the cluster assignment (labels)

Assign the cluster labels to each pixel in the original image using the .predict method of your kMeans instance. Now, you know to which color in your reduced palette each pixel belongs to, i.e. we predict the cluster assignment for each of the pixels in the original image.

(e) Create a compressed image using only 64 colors

Finally, we create the compressed image from the compressed palette and cluster assignments.

Loop over all pixels and assign the new color palette corresponding to the label of the pixel and create a new, reduced color picture. Plot the images using plt.imshow, compare the original image and the 64 color image. Try the same with 32 and 16 colors.

Machine Learning
MSE FTP MachLe
Christoph Würsch

ML12 A5 Detecting similar Faces using `DBSCAN?

The labelled faces dataset of sckit-learn contains gray scale images of 62 differnet famous personalites from politics. In this exercise, we assume that there are no target labels, i.e. the names of the persons are unknown. We want to find a method to cluster similar images. This can be done using a dimensionality reduction algorithm like PCA for feature generation and a subsequent clustering e.g. using DBSCAN.

(a) Loading the Faces Dataset

Open the Jupyter notebook DBSCAN_DetectSimilarFaces.jpynb and have a look at the first few faces of the dataset. Not every person is represented equally frequent in this unbalanced dataset. For classification, we would have to take this into account. We extract the first 50 images of each person and put them into a flat array called X_people. The correspinding targets (y-values, names), are storeed in the y_people array.

(b) Principal Component Analysis

Apply now a principal component analysis X_pca=pca.fit_transform(X_people) and extract the first 100 components of each image. Reconstruct the first 10 entries of the dataset using the 100 components of the PCA transformed data by applying the pca.inverse_transform method and reshaping the image to the original size using np.reshape.

What is the minimum number of components necessary such that you recognize the persons? Try it out.

(c) Apply DBSCAN on these features

Import DBSCAN class from sklearn.cluster, generate an instance called dbscan and apply it to the pca transformed data X_pca and extract the cluster labels using labels = dbscan.fit_predict(X_pca). Use first the standard parameters for the method and check how many unique clusters the algorithm could find by analyzing the number of unique entries in the predicted cluster labels.

(d) Variation of the eps parameter

Change the parameter eps of the dbscan using dbscan(min_samples=3, eps=5). Change the value of eps in the range from 5 to 10 in steps of 0.5 using a for loop and check for each value of eps how many clusters could be determined.

(e) Maxumum number of clusters found

Select the value of eps where the numbers of clusters found is maximum and plot the members of the clusters found using the follwing python code.

Bonus: Agglomerative and Spectral Clustering (optional)

Machine Learning
MSE FTP MachLe
Christoph Würsch

k-Means, Gaussian Mixture Models and the EM algorithm

To gain understanding of mixture models we have to start at the beginning with the expectation maximization algorithm and it's application

First a little history on the EM-algorithm

Reference: 4

Demptser, Laird & Rubin (1977) -unified previously unrelated work under "The EM Algorithm"

- unified previously unrelated work under "The EM Algorithm"
- overlooked E-M works - see gaps between foundational authors
    - Newcomb (1887)
    - McKendrick (1926) [+39 years]
    - Hartley (1958) [+32 years]
    - Baum et. al. (1970) [+12 years]
    - Dempters et. al. (1977) [+7 years]

EM Algorithm developed over 90 years

EM provides general framework for solving problems

Examples include:

- Filling in missing data from a sample set
- Discovering values of latent variables
- Estimating parameters of HMMs
- Estimating parameters of finite mixtures [models]
- Unsupervised learning of clusters
- etc...

(a) How does the EM algorithm work?

EM is an iterative process that begins with a "naive" or random initialization and then alternates between the expectation and maximization steps until the algorithm reaches convergence.

To describe this in words imagine we have a simple data set consisting of class heights with groups separated by gender.

Now imagine that we did not have the convenient gender labels associated with each data point. How could we estimate the two group means?

First let's set up our problem.

In this example we hypothesize that these height data points are drawn from two distributions with two means - < μ1, μ2 >.

For this example, we will assume the x values are drawn from Gaussian distributions.

To start the algorithm, we choose two random means.

From there we repeat the following until convergence.

The expectation step:

We calculate the expected values E(zij), which is the probability that xi was drawn from the jth distribution.

E(zij)=p(x=xi|μ=μj)n=12p(x=xi|μ=μj)=e12σ2(xiμj)2n=12e12σ2(xiμn)2

The formula simply states that the expected value for zij is the probability xi given μj divided by the sum of the probabilities that xi belonged to each μ

The maximization step:

After calculating all E(zij) values we can calculate (update) new μ values.

μj=i=1mE(zij)xii=1mE(zij)

This formula generates the maximum likelihood estimate.

By repeating the E-step and M-step we are guaranteed to find a local maximum giving us a maximum likelihood estimation of our hypothesis.

What are Maximum Likelihood Estimates (MLE)

1. Parameters describe characteristics (attributes) of a population. These parameter values are estimated from samples collected from that population.

2. A MLE is a parameter estimate that is most consistent with the sampled data. By definition it maximizes the likelihood function. One way to accomplish this is to take the first derivative of the likelihood function w/ respect to the parameter theta and solve for 0. This value maximizes the likelihood function and is the MLE

A quick example of a maximum likelihood estimate

You flip a coin 10 times and observe the following sequence (H, T, T, H, T, T, T, T, H, T)

What's the MLE of observing 3 heads in 10 trials?

simple answer:

The frequentist MLE is (# of successes) / (# of trials) or 3/10

solving first derivative of binomial distribution answer:

(21)L(θ)=(103)θ3(1θ)7(22)logL(θ)=log(103)+3logθ+7log(1θ)(23)dlogL(θ)d(θ)=3θ71θ=0(24)3θ=71θ310

That's a MLE! This is the estimate that is most consistent with the observed data

Back to our height example. Using the generalized Gaussian mixture model code sourced from Duke's computational statistics we can visualize this process.

Notice how the algorithm was able to estimate the true means starting from random guesses for the parameters.

Now that we have a grasp of the algorithm we can examine K-Means as a form of EM

K-Means is an unsupervised learning algorithm used for clustering multidimensional data sets.

The basic form of K-Means makes two assumptions

1. Each data point is closer to its own cluster center than the other cluster centers
2. A cluster center is the arithmetic mean of all the points that belong to the cluster.

The expectation step is done by calculating the pairwise distances of every data point and assigning cluster membership to the closest center (mean)

The maximization step is simply the arithmetic mean of the previously assigned data points for each cluster

The following sections borrow heavily from Jake Vanderplas' Python Data Science Handbook

To build our intuition of this process, play with the following interactive code from Jake Vanderplas in an Jupyter (IPython) notebook

Now we are ready to explore some of the nuances/issues of implementing K-Means as an expectation maximization algorithm

the globally optimal result is not guaranteed

- EM is guaranteed to improve the result in each iteration but there are no guarantees that it will find the global best. See the following example, where we initalize the algorithm with a different seed.

practical solution:

- Run the algorithm w/ multiple random initializations
- This is done by default in sklearn

number of means (clusters) have to be selected beforehand

- k-means cannot learn the optimal number of clusters from the data. If we ask for six clusters it will find six clusters, which may or may not be meaningful.

practical solution:

- use a more complex clustering algorithm like Gaussian Mixture Models, or one that can choose a suitable number of clusters (DBSCAN, mean-shift, affinity propagation)

k-means is terrible for non-linear data:

- this results because of the assumption that points will be closer to their own cluster center than others

practical solutions:

- transform data into higher dimension where linear separation is possible e.g., spectral clustering

K-Means is known as a hard clustering algorithm because clusters are not allowed to overlap.

"One way to think about the k-means model is that it places a circle (or, in higher dimensions, a hyper-sphere) at the center of each cluster, with a radius defined by the most distant point in the cluster. This radius acts as a hard cutoff for cluster assignment within the training set: any point outside this circle is not considered a member of the cluster. -- [Jake VanderPlas Python Data Science Handbook] [1]

A resulting issue of K-Means' circular boundaries is that it has no way to account for oblong or elliptical clusters.

There are two ways we can extend K-Means

1. measure uncertainty in cluster assignments by comparing distances to all cluster centers
2. allow for flexibility in the shape of the cluster boundaries by using ellipses

Recall our previous height example, and let's assume that each cluster is a Gaussian distribution!

Gaussian distributions give flexibility to the clustering, and the same basic two step E-M algorithm used in K-Means is applied here as well.

  1. Randomly initialize location and shape
  2. Repeat until converged:

    E-step: for each point, find weights encoding the probability of membership in each cluster.
    
    M-step: for each cluster, update its location, normalization, and shape based on all data points, making use of the weights

The result of this process is that we end up with a smooth Gaussian cluster better fitted to the shape of the data, instead of a rigid inflexible circle.

Note that because we still are using the E-M algorithm there is no guarantee of a globally optimal result. We can visualize the results of the model.

Notice how much better the model is able to fit the clusters when we assume each cluster is a Gaussian distribution instead of circle whose radius is defined by the most distant point.

Gaussian Mixture Models as a tool for Density Estimation

The technical term for this type of model is:

generative probabilistic model

Why you ask?

Because this model is really about characterizing the distribution of the entire dataset and not necessarily clustering. The power of these types of models is that they allow us to generate new samples that mimic the original underlying data!

If we try to cluster this data set we run into the same issue as before.

Instead let's ignore individual clusters and model the whole distribution of data as a collection of many Gaussians.

Looks like the collection of clusters has fit the data set reasonably well. Now let's see if the model has actually learned about this data set, such that we can create entirely new samples that look like the original.

Generative models allow for multiple methods to determine optimal number of components. Because it is a probability distribution we can evaluate the likelihood of the data using cross validation and/or using AIC or BIC.

Sklearn makes this easy.